library(fpp3)
library(patchwork) # Necessary to produce combinations of graphs
# E.g. when we do p1 + p2 (with p1 an p2 being ggplot objects)05_6_A_Point_Forecast_Accuracy
References
NOTE: the following material is not fully original, but rather an extension of reference 1 with comments and other sources to help students understand the concepts.
Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
Fable package documentation
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting, 22(4), 679–688.
Packages
Short general intro: absolute and relative errors
Let \(x\) and \(\hat{x}\) be real numbers, with \(\hat{x}\) being an approximation of x. There are essentially two ways to quantify the error of \(\hat{x}\) when approximating \(x\):
- differences
- ratios
Absolute error (based on differences)
In this case we add the absolute value function because we want to deal with a distance and we do not want positive and negative errors to even out when averaging.
\(E_{abs}(\hat{x}) = |x-\hat{x}|\)
Note that this error has the same units than \(x\).
- E.g. if \(x\) is in USD, the error is in USD.
Relative error (ratios)
\(E_{rel} =\frac{|x-\hat{x}|}{|x|}\)
Note that this measure of the error is dimensionless. We have divided by the magnitude we wanted to approximate and therefore the units in the numerator and the denominator cancel each other out.
Other errors
There are other possible error metrics (e.g. scaled errors), but the underlying idea is always the same: either ratios or differences.
Forecast errors vs residuals
- Residuals are calculated on the training set. Forecast errors are calculated on the test set.
- Resdiuals are based on one-step forecasts (the fitted values) while forecast errors can involve multi-step forecasts.
Remember the summary figure I have insisted upon:
Forecast error: difference between an observed value and its forecast.
Error \(\neq\) mistake. Think of it as the “unpredictable part of an observation”. That which the model is not able to capture.
- Training data: \(\{y_1,\dots,y_T\}\)
- Test data: \(\{y_{T+1}, y_{T+2},\dots, y_{T+h}\}\)
The figure below further summarizes this:
Metrics to summarize forecast errors
We can take the set of forecast errors (also the set of residuals) and produce different summary metrics. In more simple words, we can take the error at each point and compute overall summary metrics of the errors.
Scale-dependent error metrics
These summary error metrics are on the same scale as the data, because they use errors based on differences:
\[ e_t = y_t - \hat{y_t} \] They are therefore scale-dependent and cannot be used to make comparisons between series that involve different units.
Mean Absolute error (MAE)
We take the absolute value of the errors to prevent positive and negative errors from canceling each other out when averaging:
\[\begin{align*} \text{Mean absolute error: MAE} & = \text{mean}(|e_{t}|) = \text{mean}(|y_t - \hat{y_t}|) \end{align*}\]- Easy to interpret
The expression for the MAE takes a somewhat different look deending on whether we are working with forecast errors or with the model residuals:
Expression of the MAE for the model residuals and for forecast errors:
- For model residuals:
- T is the length of the residuals (the length of the vector of residuals, after removing NAs).
\[ \begin{align*} \text{mean}(|e_{t}|) = \frac{1}{T}\sum_{t=1}^{t=T}|y_t - \hat{y_t}| \end{align*} \]
- For forecast errors:
- h is the forecast horizon
\[ \begin{align*} \text{mean}(|e_{t}|) = \frac{1}{h}\sum_{j=1}^{j=h}|y_{T+j} - \hat{y}_{T+j}| \end{align*} \]
MAE minimisation and forecasts on the median
- IMPORTANT: a forecast method that minimizes the MAE will lead to forecasts of the median of the RV we want to predict (\(y_t\))
- PROOF: will be provided in a separate video/notebook.
Root Mean Squared Error (RMSE)
We square the erros for two reasons:
- The resulting vector of errors has only positive elements. This prevents positive and negative errors from cancelling each other out when averaging.
- The bigger the error, the more it is amplified by the squaring operation. This is an easy way to assign a bigger weight to large errors than to small errors.
\[ \text{Root mean squared error: RMSE} = \sqrt{\text{mean}(e_{t}^2)} = \sqrt{\text{mean}((y_t - \hat{y_t})^2)} \]
Expression of the RMSE for the model residuals and for forecast errors:
- For model residuals:
- T is the length of the residuals (the length of the vector of residuals, after removing NAs).
\[ \sqrt{\text{mean}(e_{t}^2)} = \sqrt{\frac{1}{T}\sum_{t=1}^{t=T}(y_t - \hat{y_t})^2} \]
- For forecast errors: now there are now equations limiting the degrees of freedom. Therefore:
- h is the forecast horizon
\[ \sqrt{\text{mean}(e_{t}^2)} = \sqrt{\frac{1}{h}\sum_{j=1}^{j=h}(y_{T+j} - \hat{y}_{T+j})^2} \]
RMSE minimisation and forecasts on the mean:
- IMPORTANT: a forecast method that minimizes the RMSE will lead to forecasts of the mean of the RV we want to predict (\(y_t\))
- PROOF(TODO)
Percentage errors (relative errors)
We scale the error at each point by the actual value at that point of what we are trying to forecast. Note that these are relative errors
\[ p_{t} = 100 \frac{e_{t}}{y_{t}} = 100 \frac{y_t - \hat{y_t}}{y_t} \]
- Unit-free - used to compare forecast performances between data sets
- Drawbacks
Percentage errors become infinite or undefined if \(y_t = 0\) or close to zero
Percentage errors assume the unit of measurement has a meaningful zero:
- e.g. percentage errors make no sense when measuring the accuracy of temperature forecasts on either the Fahrenheit or Celsius scales, because these scales have an arbitrary zero point. Therefore scaling by \(y_t\) (in this case the temperature) is meaningless, since the 0 of the scale has been arbitrarily set and therefore the \(y_t\), the measure used to scale our errors, is itself arbitrary.
Percentage errors are often said to impose a heavier penalty on negative errors than on positive errors. This is arguable. Let us consider the following example:
Case 1: \(y_t = 150\) ; \(\hat{y_t} = 100\)
\[ \frac{|y_t-\hat{y_t}|}{y_t}*100 =\frac{|150-100|}{150}*100 =33.33% \]
- Case 2: \(y_t = 100\) ; \(\hat{y_t} = 150\)
\[ \frac{|y_t-\hat{y_t}|}{y_t}*100 =\frac{|100-150|}{100}*100 = 50% \]
Indeed case 2 (negative error) results in a greater percentage error. However, the reason for the heavier penalty is not that the error is positive or negative, but that the actual value has changed. Let us consider an example with positive and negative errors of the same magnitude where the actual value remains constant:
- Case 3: \(y_t = 100\) ; \(\hat{y_t} = 150\)
\[ \frac{|y_t-\hat{y_t}|}{y_t}*100 =\frac{|100-150|}{100}*100 = 50% \]
- Case 4: \(y_t = 100\) ; \(\hat{y_t} = 50\)
\[ \frac{|y_t-\hat{y_t}|}{y_t}*100 =\frac{|100-50|}{100}*100 = 50% \]
Further discussion on ref [3] (unfortunately in German). Reference [4] is very good as well.
MAPE: Mean Absolute Percentage Error
We take the absolute value of the errors to prevent positive and negative errors from canceling each other out when averaging:
\[ \text{Mean absolute percentage error: MAPE} = \text{mean}(|p_{t}|) \]
sMAPE: “symmetric” Mean Absolute Percentage Error
This section is included for completion, but sMAPE is not part of the subject or the exam.
\[ \text{sMAPE} = \text{mean}\left(200|y_{t} - \hat{y}_{t}|/(y_{t}+\hat{y}_{t})\right) \]
- Supposedly solves the problem of MAPE penalizing negative errors more.
- Value can be negative (not really a measure of “absolute percentage errors”)
- Does not solve the division by 0 problem (if \(y_t\) is close to 0, \(\hat{y_t}\) is also likely close to 0)
- Use not recommended. Included because used in some contexts.
Example 1 of error metrics
We will work with the australian production of beer:
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)
recent_production %>% autoplot(Beer)Fitting the models to the training dataset and generating forecasts
# Create training dataset
beer_train <- recent_production %>%
filter(year(Quarter) <= 2007)
beer_fit <- beer_train %>%
model(
Mean = MEAN(Beer),
`Naive` = NAIVE(Beer),
`Seasonal naive` = SNAIVE(Beer),
Drift = RW(Beer ~ drift())
)
beer_fc <- beer_fit %>%
forecast(h = 10)
beer_fc %>%
autoplot(
aus_production %>% filter(year(Quarter) >= 1992),
level = NULL
) +
labs(
y = "Megalitres",
title = "Forecasts for quarterly beer production"
) +
guides(colour = guide_legend(title = "Forecast"))Residual errors: errors on the training dataset
To compute the errors on the training dataset (that is, the residual errors), we use the accuracy() function on the mable containing the fitted models:
summary_tr <- beer_fit %>%
accuracy() %>%
select(.model, .type, MAE, RMSE, MAPE, MASE, RMSSE) %>%
arrange(MAE) # Order from smallest to highest MASE
summary_tr# A tibble: 4 × 7
.model .type MAE RMSE MAPE MASE RMSSE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Seasonal naive Training 14.3 16.8 3.31 1 1
2 Mean Training 35.2 43.6 7.89 2.46 2.60
3 Naive Training 54.7 65.3 12.2 3.83 3.89
4 Drift Training 54.8 65.3 12.2 3.83 3.89
The column .type indicates that we have computed these metrics on the training dataset, hence they are based on the residuals.
Interpretation of MAE and RMSE
The MAE and RMSE have the same units as the original time series, as we saw earlier in this notebook. That is, megalitres of beer
In this case the Seasonal Naïve is best both in terms of MAE and in terms of RMSE. In a more general scenario one model could result in a better MAE and another in a better RMSE. Remember we discussed that
- Picking the model that minimizes the MAE will lead to forecasts on the median.
- Picking a model that minimizes the RMSE will lead to forecasts on the mean.
For the seasonal naïve model, the MAE of roughly 14.3 indicates that, on average, the fitted values produced by the Seasonal Naïve model are off by 14.3 megalitres of beer in terms of absolute error.
The RMSE is more difficult to interpret but essentially we use squares instead of the absolute values to ensure that all errors are positive so that they do not cancel out when computing the mean. Using this metric we conclude that the fitted values of the seasonal naive are off by 16.78 megalitres on average.
Further down in this notebook we will compute the values above manually to understand how they were obtained.
Remember that the metrics above have been computed on the training dataset. To really judge the performance of a model we need to expose it to data it has been exposed to during training, that is, to test data.
Interpretation of MAPE
The MAPE is a scale free error because, at each point, we have computed the percentage error, scaling the error at the specific point with the value of the time series. Remember we saw earlier:
\[ p_{t} = 100 e_{t}/y_{t} \]
\[ \text{Mean absolute percentage error: MAPE} = \text{mean}(|p_{t}|) \]
The MAE of 3.31 of the Seasonal Naive Model means that, on average, the fitted values of the seasonal naive model are off by 3.31 percent from the actual time series (in absolute value terms).
Forecast errors: errors on the test dataset
To compute the errors on the test dataset (that is, the forecast errors), we use the accuracy() on the fable (forecast table) obtained when generating the forecasts. We also need to feed the whole dataset to the accuracy function, that is, the dataset containing both the training and the test dataset:
summary_test <- beer_fc %>%
accuracy(recent_production) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE) %>%
arrange(MAE) # Order from smallest to largest MAE
summary_test# A tibble: 4 × 7
.model .type RMSE MAE MAPE MASE RMSSE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Seasonal naive Test 14.3 13.4 3.17 0.937 0.853
2 Mean Test 38.4 34.8 8.28 2.44 2.29
3 Naive Test 62.7 57.4 14.2 4.01 3.74
4 Drift Test 64.9 58.9 14.6 4.12 3.87
The column .type indicates that these error metrics are based on the test dataset, on how well the forecasts of our model compare to the test data.
From the table above we can see that the seasonal naïve outperforms all other models on this particular test data by every metric. That is, all the error metrics computed are smaller for the seasonal naïve model.
Normally the situation will not be so clear: one model will perform better in a specific metric, while another will perform better in another. In these cases we will need to choose which metric we would like to minimize (e.g. minimizing the RMSE leads to forecast on the mean while minimizing MAE leads to forecasts on the median).
In these cases it will be important to understand in detail the meaning of each method and our goal when producing forecasts.
Interpretation of MAE and RMSE
The interpretation is the same as in the case of the residuals, only now the metrics refer to forecast errors (test set) instead of residuals (training set).
Interpretation of MAPE:
The interpretation is the same as in the case of the residuals, only now it refers to forecast errors (test set) instead of residuals (training set).
Manual computation: compute the MAE and RMSE manually for the drift model in the example above both for the training dataset and for the test dataset. Compare it to the output of the accuracy() function:
Using accuracy provides you with the values, but I want you to truly understand how they have been computed. And there is only one way to do that, let us do it ourselves :)
The example below does precisely this.
- For the training dataset:
# 1. Extract the residuals produced by the drift model from beer_fit using
# augment(), filter() and pull()
resid <- beer_fit %>%
select(Drift) %>%
augment() %>%
pull(.innov) %>%
na.omit() #remove NAs
# 2. Compute the MAE and RMSE using their respective formulas
MAE_manual <- mean(abs(resid))
RMSE_manual <- sqrt(mean((resid)^2))
# 3. Compare with the valuable computed by fable
MAE_fable <- summary_tr %>% filter(.model == "Drift") %>% pull(MAE)
RMSE_fable <- summary_tr %>% filter(.model == "Drift") %>% pull(RMSE)
all.equal(MAE_manual, MAE_fable)[1] TRUE
all.equal(RMSE_manual, RMSE_fable)[1] TRUE
- For the test dataset:
# 1. Extract the forecasts produced by the drift model from beer_fc
# using filter() and pull()
fc <- beer_fc %>% filter(.model == "Drift") %>% pull(.mean) %>% na.omit()
# 2. Extract the true values of the test data from recent_production
# using filter() and pull()
tv <- recent_production %>% select(Quarter, Beer) %>%
filter(Quarter >= yearquarter("2008 Q1")) %>% pull(Beer) %>% na.omit()
# 3. Compute the errors as the difference between true values and forecasts
err <- tv-fc
# 4. Compute the MAE and RMSE averaging errors
RMSE_manual <- sqrt(mean(err^2))
MAE_manual <- mean(abs(err))
# 5. Compare with the vales computed by the function accuracy()
RMSE_fable <- summary_test %>% filter(.model == "Drift") %>% pull(RMSE)
MAE_fable <- summary_test %>% filter(.model == "Drift") %>% pull(MAE)
all.equal(RMSE_manual, RMSE_fable)[1] TRUE
all.equal(MAE_manual, MAE_fable)[1] TRUE
Example 2 of error metrics with non-seasonal data
In this example I do not go too much into the details of the interpretation of MAE, RMSE and MAPE, because they do not change that much.
We are going to compute errors on the google stock data, taking 2015 as the training dataset and testing on data from january 2016. This time, however, we are not going to compute the values manually, just using the fable library.
# Data filtering
google_stock <- gafa_stock %>%
filter(Symbol == "GOOG", year(Date) >= 2015) %>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
google_2015 <- google_stock %>% filter(year(Date) == 2015)
google_jan_2016 <- google_stock %>%
filter(yearmonth(Date) == yearmonth("2016 Jan"))# Fit model
google_fit <- google_2015 %>%
model(
Mean = MEAN(Close),
`Naïve` = NAIVE(Close),
Drift = RW(Close ~ drift())
)
# Generate forecasts
google_fc <- google_fit %>%
forecast(google_jan_2016)
google_fc %>%
autoplot(bind_rows(google_2015, google_jan_2016),
level = NULL) +
labs(y = "$US",
title = "Google closing stock prices from Jan 2015") +
guides(colour = guide_legend(title = "Forecast"))Residual errors: errors on the training dataset
summary_tr <- google_fit %>%
accuracy() %>%
select(.model, .type, MAE, RMSE, MAPE, MASE, RMSSE) %>%
arrange(MASE) # Order from smallest to highest MASE
summary_tr# A tibble: 3 × 7
.model .type MAE RMSE MAPE MASE RMSSE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naïve Training 7.13 11.2 1.17 1 1
2 Drift Training 7.16 11.1 1.18 1.00 0.996
3 Mean Training 71.6 81.9 11.7 10.0 7.32
We can see that the Naïve and the drift method are those that are performing best in the training dataset (i.e. in terms of smaller residuals)
Forecast errors: errors on the test dataset
summary_test <- google_fc %>%
accuracy(google_stock) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE) %>%
arrange(RMSE) # Order from smallest to largest RMSE
summary_test# A tibble: 3 × 7
.model .type RMSE MAE MAPE MASE RMSSE
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naïve Test 43.4 40.4 5.67 5.67 3.88
2 Drift Test 53.1 49.8 6.99 6.99 4.74
3 Mean Test 118. 117. 16.2 16.4 10.5
For this particular dataset and split of train and test data, the naïve model seems to outperform the rest of the models based an all metrics.
Exercise 1
Take the retail series we have used in the previous notebook. The code below filters the series from aus_retail and creates a training dataset that leaves out 36 months (3 years) for testing purposes:
# Extract series
retail_series <- aus_retail %>%
filter(`Series ID` == "A3349767W")
# Create train dataset
n_test_years = 3
retail_train <-
retail_series %>%
slice(1:(n()-12*n_test_years))
# Check difference = 36
nrow(retail_series) - nrow(retail_train)[1] 36
Now let us fit the same decomposition_model() we used previously, but this time only to the training dataset
retail_fit <-
retail_train %>%
model(stlf = decomposition_model(
# 1. Define model to be used for the decomposition
STL(log(Turnover) ~ trend(window = 7), robust = TRUE),
# 2. Define model for the seasonally adjusted component
RW(season_adjust ~ drift())
))
retail_fit# A mable: 1 x 3
# Key: State, Industry [1]
State Industry stlf
<chr> <chr> <model>
1 Northern Territory Clothing, footwear and personal … <STL decomposition model>
And let us produce forecasts for 3 years:
retail_fc <-
retail_fit %>%
forecast(h = 36)
autoplot(retail_fc, retail_series, level = 80, point_forecast = lst(mean, median))Warning in ggplot2::geom_point(mapping = mapping, data =
dplyr::semi_join(object, : Ignoring unknown aesthetics: linetype